Dependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition

نویسندگان

  • Hiram Calvo
  • Andrea Segura-Olivares
  • Alejandro García
چکیده

Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this prob­ lem, several lexical, syntactic and semantic based tech­ niques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntac­ tic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic ngrams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each ap­ proach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syn­ tactic dependency n-grams, or syntactic constituent ngrams. We compare too both approaches with traditional n-grams and state-of-the-art systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Baselines for Natural Language Processing Tasks Based on Soft Cardinality Spectra

Soft-cardinality spectra (SC spectra) is a new method of approximation for text strings in linear time, which divides text strings into character q-grams of different sizes. The method allows simultaneous use of weighting at term and q-gram levels. SC spectra in combination with resemblance coefficients allows the construction of a family of text similarity functions that only use the surface i...

متن کامل

PPDB: The Paraphrase Database

We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 mill...

متن کامل

TextFlow: A Text Similarity Measure based on Continuous Sequences

Text similarity measures are used in multiple tasks such as plagiarism detection, information ranking and recognition of paraphrases and textual entailment. While recent advances in deep learning highlighted further the relevance of sequential models in natural language generation, existing similarity measures do not fully exploit the sequential nature of language. Examples of such similarity m...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

Web-Scale Features for Full-Scale Parsing

Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address the full range of syntactic attachments. These features encode both surface evidence of lexical af...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computación y Sistemas

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2014